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Abstract.  Measurements  of  optical  turbulence  time  series  data  using  unattended  instruments 
over  long  time  intervals  inevitably  lead  to  data  drop-outs  or  degraded  signals.  We  present 
a  comparison  of  methods  using  both  Principal  Component  Analysis,  which  is  also  known 
as  the  Karhunen-Loeve  decomposition,  and  ARIMA  that  seek  to  correct  for  these  event- 
induced  and  mechanically-induced  signal  drop-outs  and  degradations.  We  report  on  the 
quality  of  the  correction  by  examining  the  Intrinsic  Mode  Functions  generated  by  Empirical 
Mode  Decomposition.  The  data  studied  are  optical  turbulence  parameter  time  series  from  a 
commercial  long  path  length  optical  anemometer/scintillometer,  measured  over  several  hundred 
metres  in  outdoor  environments. 


1.  Introduction 

How  many  data  points  can  be  missed  off  (set  to  zero)  from  a  1-D  discrete,  regularly  spaced  time 
series  record  and  still  be  recoverable?  Such  a  question  is  prompted  by  many  situations,  often 
associated  to  the  automated  collection  of  data.  We  are  motivated  to  address  this  question  by  our 
experimental  field  work  to  record  the  behaviour  of  the  strength  of  optical  turbulence  parameter, 
over  time  intervals  of  several  weeks,  under  different  climate  conditions  EH  121 H  HIE]. 

The  data  of  interest  to  us  are  path  integrated  measures  of  C2  measured  across  the  visible  to 
near  infrared  region.  The  measurement  path  is  a  600  m  stretch  of  free  space,  approximately  1.5 
nr  altitude  above  sea  water  in  the  Caribbean.  The  probe  beam  originated  from  an  LED  centered 
on  0.9  gm;  the  instrumentation  are  two  identical  Optical  Scientific  Inc.  LOA-  oo43  systems, 
which  employ  aperture  averaging  to  estimate  the  value  of  C2.  The  LOA-004s  are  generally 
left  unattended  over  the  course  of  up  to  a  few  days  during  field  operation,  which  mean  that 
they  are  susceptible  to  spurious  events  which  result  in  gaps  in  the  data  record.  Although  for 
some  types  of  data  analysis  techniques,  data  gaps  of  a  limited  size  can  be  tolerated,  this  is  not 
universal.  Moreover,  there  exists  a  new  class  of  very  powerful  techniques  based  on  Empirical 
Mode  Decomposition  Earn  that  are  exceedingly  sensitive  to  lossy  datasets.  It  is  for  this  reason 
we  are  exploring  different  methodologies  to  synthetically  fill  the  data  holes;  we  describe  the 
results  from  our  studies  using  Principal  Component  Analysis  in  this  paper. 

1  see  http://www.opticalscientific.com  or  email  info@opticalscientific.com. 
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2.  Karhunen— Loeve  or  Principal  Component  Analysis 

The  principal  components  of  any  ensemble  can  be  used  to  identify  the  members  of  that  ensemble. 
This  idea  forms  the  foundation  of  face  recognition  and  tracking  through  eigenfaces  (see,  for 
example,  Turk  and  Pentland  1991  |8j).  We  may  extend  this  method  to  reconstruct  the  missing 
data  for  any  data  record  under  given  restrictions.  The  key  point  is  that  the  gappy  data  record 
must  have  the  same,  or  similar,  salient  features  as  all  the  members  of  the  ensemble. 

The  principal  components  are  the  eigenvectors  of  the  covariance  matrix  of  the  data  and 
represent  the  features  of  the  dataset.  Provided  that  a  reference  library  can  be  created,  each 
member  of  the  library  will  contribute  to  each  eigenvector,  more  or  less.  As  such,  each  member 
can  be  exactly  represented  by  a  linear  combination  of  eigenvectors.  Any  similar  data  record 
external  to  the  library  will  also  be  represented  by  a  linear  combination  of  eigenvectors,  within  a 
margin  of  error. 

The  first  step  for  the  filling  procedure  must  therefore  be  to  define  the  reference  library. 
We  may  do  so  by  collecting  a  family  of  turbulence  data  series  that  share  certain  specific 
characteristics;  a  difficult  task  since  the  definition  of  such  is  an  open  question.  Moreover, 
since  the  mean  value  of  the  family  of  reference  data  plays  a  key  part,  all  the  members  of  the 
library  would  require  normalisation.  Again  how  to  do  this  is  an  open  question.  Alternatively  we 
may  use  the  neighbouring  record  around  a  data  hole,  sectioning  this  information  to  provide  the 
ensemble  members.  We  opt  for  the  latter  technique  since  the  record  pre  and  post  the  data  hole 
(within  a  certain  time  interval)  ought  to  be  similar  in  nature  to  the  missing  data.  The  mean 
value  of  this  type  of  library  would  probably  not  differ  greatly  from  the  mean  of  the  missing  data, 
so  normalisation  would  not  be  so  crucial. 

How  does  one  determine  the  principal  components  of  a  reference  library?  Following  Sirovich 
and  Kirby  [9 j ,  let  the  M  members  (each  of  length  N )  of  the  reference  ensemble  be  {pn\-  Thus, 
the  average  data  record  of  this  ensemble  will  be 
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It  is  very  reasonable  to  assume  that  departures  from  the  mean  record  will  provide  an  efficient 
procedure  for  extracting  the  primary  features  of  the  data.  Therefore,  we  define 


4*71  —  Pn  P 


(2) 


Now,  if  we  consider  the  dyadic  matrix 


M 

C=Y,  Mn  =  AAT  (3) 

71=1 

where  each  term  of  the  sum  signifies  a  second  rank  tensor  product,  we  can  recognize  this  as  the 
ensemble  average  of  the  two  point  correlation  of  the  deviations  from  the  mean.  Here,  AT  is  the 
transpose  of  A. 

We  require  eigenvectors  un  of  the  matrix  AAT .  For  ensembles  whose  members  have  a  large 
number  of  points  N  >  M ,  matrix  AAT  issingular  and  its  order  cannot  exceed  M .  To  find  those 
eigenvectors  of  AAT  corresponding  to  nonzero  eigen  values,  Turk  and  Pentland  used  a  standard 
singular  value  decomposition  technique,  as  described  below. 

AT  Avn  =  pvn 
AA  Avn  —  pnAvn 
CAvn  =  pAvn 


(4) 


where  fin  are  the  eigenvalues.  This  deduction  can  be  equated  to 

Cun  =  nnun  (5) 

where  un  =  Avn.  Thus  AAT  and  AT A  have  the  same  eigenvalues  and  their  eigenvectors  are 
related  through  un  =  Avn,  provided  that  ||itn||  =  1.  The  treatment  described  is  recognizable  as 
the  Karhunen-Loeve  (KL)  method  [lOj. 

The  implication  is  that  a  dataset  <fi'  can  be  obtained  from  a  limited  summation 

M 

(t>  ~  y  (  dnun  (6) 

n= 1 

where  the  coefficients  an  are  obtained  through  the  inner  product 

O-n  —  (0  j  ^n)  (7) 

We  emphasise  that  4>'  is  not  considered  to  be  part  of  the  { 4>n },  although  it  is  similar  in  features. 

3.  Proof  of  Concept 

To  demonstrate  the  validity  of  the  assertion  of  the  previous  section,  we  take  a  perfect  data 
record  of  measurements  over  a  7  hour  period  starting  from  midnight,  smoothed  by  a  forward 
moving  rolling  average  of  5  minutes’  interval  (60  data  points).  The  data  contain  2492  points  in 
total.  We  split  up  the  test  data  into  21  sections,  where  all  but  1  are  members  of  the  reference 
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Figure  1.  The  test  data,  split  into  21  sections.  The  data  are  padded  before  division  with  a  set 
of  points  taken  from  the  tail  of  the  time  series  and  mirrored  outward. 


library,  as  shown  in  Fig.  |T|  The  exclusive  section  is  set  to  zero,  and  the  algorithm  described 
in  Sec.  [2]  is  employed  on  the  20  library  elements.  The  reconstructions  are  created  from  the  KL 
coeffiecients  of  the  eigenvectors  equivalent  to  the  sections  adjacent  to  the  missing  data.  Hence 
we  will  talk  of  a  prior  and  posterior  reconstruction  meaning  e.g.  for  omitted  section  5,  we  use  for 
the  prior  the  KL  coefficient  equivalent  to  section  4  and  for  the  posterior  reconstruction  we  use 
the  coefficient  equivalent  to  section  6  of  the  test  data  set.  Note  that  this  does  not  reconstruct 
those  sections,  since  the  eigenvectors  are  generated  from  the  entire  reference  library. 


The  reconstructions  shown  in  the  left  hand  set  of  Fig.  [2]  represent  the  best  level  of  error, 
while  the  right  hand  set  shows  the  worst.  It  is  clear  that  the  greater  the  difference  between 
the  (masked  off)  original  data  section  and  its  neighbours,  the  poorer  the  reconstruction  will  be. 
Nevertheless,  the  reconstruction  errors  for  the  full  set  of  test  data  are  all  1  order  of  magnitude 
less  than  the  mean  value  of  the  reconstruction  and  the  original  data  segment.  Evidently  the  end 
points  have  not  been  synthesised  to  be  continuous  with  the  adjacent  segments  of  the  reference 
library  signal,  as  can  be  seen  from  the  error.  A  continuity  condition  in  terms  of  the  both  the 
function  and  its  derivative  has  to  be  imposed  on  both  ends  of  the  reconstructed  segment  in 
order  to  achieve  smoothness.  The  most  effective  way  to  do  so  is  currently  being  investigated. 
The  eigenvalues  determined  through  this  method  show  a  similar  distribution  in  both  cases, 
with  minor  variations  only  at  the  upper  end  of  the  spectrum,  implying  that  the  KL  differences 
between  best  and  worst  section  for  reconstruction  is  not  very  large. 

3.1.  KL  and  EMD 

In  the  absence  of  a  workable  continuity  condition,  we  present  here  the  effects  of  crudely  patching 
the  data  hole  with  reconstructions  determined  from  the  prior  and  posterior  terms  with  respect 
to  the  gap. 

We  apply  the  Empirical  Mode  Decomposition  (EMD)  algorithm  [2]  to  the  reconstructions. 
EMD  is  a  novel  adaptive  method  for  separating  a  nonlinear  time  series  into  components  based 
on  instantaneous  frequency.  Basically  it  acts  as  a  dyadic  filter  bank  [7];  the  set  for  the  best 
reconstruction  case  are  illustrated  in  Figs.  EH  We  refer  to  these  sets  as  EMDl  and  EMDr. 
The  original  data’s  intrinsic  mode  functions  (IMFs)  and  residuals  (set  EM  Do)  are  also  shown 
and  for  comparison,  we  present  the  effect  of  a  simple  minded  linear  interpolation  between  the 
endpoints  of  the  known  data  on  the  IMFs. 

Upon  visual  inspection,  we  see  that  the  original  data  generates  9  components:  8  intrinsic 
modes  and  1  residual  (the  stopping  criterion  for  our  EMD  implementation  is  the  same  as  in 
Huang  et  al  0)-  On  the  other  hand,  the  interpolated  data  have  only  8  components.  Numbering 
the  IMFs  from  highest  instantaneous  frequency  to  lowest,  starting  from  IMF  1,  we  can  see  by 
inspection  that  both  EMDl  and  EMDr  are  strongly  similar  to  EMDo  in  IMFs  4,5,6  and  7. 
The  differences  appear  in  the  higher  frequency  components,  due  to  the  discontinuity  between 
the  inserted  segment  and  the  unadulterated  data.  The  endpoint  discontinuities  evidently  modify 
the  variances  of  IMFs  1,2  and  3,  although  it  seems  that  they  are  only  affected  in  the  area  local 
to  the  discontinuity,  per  IMF. 

By  way  of  comparison,  a  linear  interpolant  between  the  edges  of  the  known  data  show  that 
there  is  contamination  all  through  the  IMFs.  It  is  so  strong  that  IMF  7,  which  in  the  other  sets 
clearly  distinguishes  the  baseline  rise  and  fall  of  the  turbulence  over  the  interval  under  study,  is 
unable  to  pick  out  a  clean  pedestal. 

4.  ARIMA 

Instead  of  decomposing  the  data  sequence  into  KL  components,  an  alternative  method  studied  is 
that  of  ARIMA  (Auto-Regressive  Integrated  Moving  Average)  [15].  This  technique’s  principle  is 
founded  on  taking  advantage  of  the  past  (first  moment)  behaviour  of  stochastic  data  to  predict 
the  future  characteristics.  It  is  a  well  known  technique  used  in  econometrix,  and  we  have 
employed  it  here  to  investigate  its  effects  on  EMD. 

An  ARIMA(p,  d,  q)  process  is  the  solution  of  the  following  equation 

cj)(B)(l-B)dXt='i/j(B)et  (8) 

where  BXt  =  W-i  is  the  backshift  operator,  e*  represents  a  white  noise  process,  and  are 
polynomials  of  degree  p  and  q  respectively.  For  mathematical  convenience  we  take  p  and  q  to 
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Figure  2.  Best  and  worst  reconstruction  result  using  PCA  for  (a)  section  5  and  (b)  section 
13  respectively.  The  layout  in  each  set  is  :  (Top  Left)  Segment  prior  to  the  selection.  (Top 
Middle)  The  original  selected  data.  (Top  Right)  Segment  after  selection.  (Centre  Left)  The 
difference  between  the  reconstruction  using  the  coefficient  of  eigenvector  related  to  the  prior 
segment  and  the  original.  The  value  shown  is  the  mean  absolute  difference  per  pixel.  (Centre 
Right)  The  (prior)  reconstruction.  (Bottom  Left)  The  difference  between  the  reconstruction 
using  the  coefficient  related  to  the  posterior  segment  and  the  original.  (Bottom  Middle)  The 
(posterior)  reconstruction.  (Bottom  Right)  The  eigenvalue  spectrum. 
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Figure  3.  (Top  Left)  The  IMFs  1  to  8  (top  to  bottom)  and  the  residual  trend  line  of  EMDo, 
generated  by  applying  Empirical  Mode  Decomposition  to  the  test  data.  (Top  Right)  The  IMFs 
1  to  7  and  the  residue  of  EMDl,  generated  from  data  using  the  PCA  reconstruction  method 
with  the  coefficient  of  the  prior  section.  (Bottom  Left)  The  IMFs  1  to  7  and  the  residue  of 
EMDr,  generated  from  data  using  the  PCA  reconstruction  method  of  the  posterior  section. 
(Bottom  Right)  The  IMFs  1  to  7  (top  to  bottom)  and  the  residue,  generated  from  data  with  a 
linear  interpolant  across  section  5  of  the  test  data. 


be  zero.  Fractional  d  generalises  the  differencing  parameter  between  data  points  to  obtain  the 
long  range  dependence.  In  general,  0  <  d  <  0.5,  where  longer  memory  is  represented  by  higher 
values  of  d. 

4-1.  A  RIM  A  and  EMD 

We  applied  an  ARIMA(0,d,0)  model  to  predict  the  behaviour  of  the  final  section  of  the  split 
test  data.  The  length  of  each  section  is  119  points  long;  by  basing  the  data  behaviour  on  the 
previous  2373  points  and  by  setting  the  difference  operator  to  be  d  =  0.5  we  generated  the  IMF 
results  shown  in  Fig.  [H 


Figure  4.  The  IMFs  and  the  residual  trend  line  of  EM  Do,  generated  by  applying  Empirical 
Mode  Decomposition  to  ARIMA(0,d  =  0.5,0). 

As  can  be  seen,  the  effect  of  the  ARIMA(0,  d,  0)  extrapolation  is  to  fill  in  the  higher  order 
IMFs  in  a  smooth  fashion  (apparently,  by  visual  inspection  ,  IMFs  5  to  8  are  smoothly  filled  at 
the  tail).  It  is  only  untrustworthy  when  considering  the  higher  frequency  components,  because 
ARIMA  considers  only  the  conditional  first  moment.  Note  that  the  number  of  IMFs  and  residual 
are  the  same  as  the  unadulterated  signal  EM  Do- 

5.  CONCLUSIONS 

We  have  discussed  a  data  hole  filling  method  for  stochastic  data,  a  necessary  step  to  be  able  to  use 
the  new  technique  of  Empirical  Mode  Decomposition  upon  a  time  series  record.  The  Karhunen- 
Loeve  eigenvectors  from  an  ensemble  of  neighbouring  sections  of  complete  data  around  a  data 
hole  can  be  used  to  reconstruct  the  missing  segment  to  a  reasonable  degree  of  accuracy,  at 
least  for  the  purposes  of  applying  EMD.  We  have  shown  that  the  edge  continuity  is  important, 
although  the  effect  of  discontinuities  is  not  universal  through  all  the  intrinsic  modes  of  the  data. 
The  quality  of  the  reconstruction  is  much  better  using  PCA  than  a  simplistic  linear  interpolant 
(or  merely  ignoring  the  data  gap).  We  compared  this  to  a  simplified  ARIMA(0,d,0)  model, 
which  performed  better  than  the  linear  interpolant  but  less  effectively  than  the  KL  algorithm, 
disregarding  edge  effects. 

In  summary,  when  a  data  hole  is  present,  merely  ignoring  the  data  hole  or  filling  it  with  a 
linear  interpolant  is  a  poor  technique  from  the  point  of  view  of  EMD.  The  result  of  doing  so 
leads  to  leakage  of  spurious  effects  both  laterally  in  time  (per  IMF)  and  longitudinally  across 
all  the  IMFs.  By  reconstructing  the  datagap  with  ARIMA(0,  d  =  0.5,0)  one  can  limit  the 
leakage  laterally  and  longitudinally,  although  the  lowest  IMFs  remain  strongly  contaminated 
with  artificial  structure.  KL  reconstruction  limits  the  leakage  best,  and  even  without  considering 
the  continuity  between  the  adjacent  sections  and  the  reconstructed  datagap,  it  promises  to 
provide  the  best  reconstruction  of  the  three  methods  described  in  this  paper. 
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